Here we walk through an end-to-end gene-level RNA-Seq differential expression workflow using command-line tools and Bioconductor packages. We will start from the FASTQ files, show how to align these to the reference genome, and prepare a count matrix which tallies the number of RNA-seq reads/fragments within each gene for each sample. We will perform exploratory data analysis (EDA) for quality assessment and to explore the relationship between samples, perform differential gene expression analysis, and visually explore the results.
There are at least four distinct phases of analysis (Figure 1.1). The first phase takes the raw sequence reads generated by a sequencing platform and maps them to the genome or transcriptome. Phase 2 quantifies the number of reads associated with each gene or transcript (an expression matrix). This process may involve one or more distinct sub-stages of alignment, assembly and quantification, or it may holistically generate the expression matrix from read counts in a single step. Usually there is a third phase where the expression matrix is altered by filtering lowly expressed features, as well as the crucial step of normalizing the raw counts to account for technical differences between the samples. The final phase in DGE is statistical modelling of the sample groups and covariates, to calculate confidence statistics related to differential expression.
Figure 1.1: RNA-seq data analysis workflow for differential gene expression
In this tutorial we will follow workflow A: aligners such as TopHat, STAR or HISAT2 use a reference genome to map reads to genomic locations, and then quantification tools, such as HTSeq and featureCounts, assign reads to features. After normalization (usually using methods embedded in the quantification or expression modelling tools, such as trimmed mean of M-values (TMM)), gene expression is modelled using tools such as edgeR, DESeq2 and limma+voom, and a list of differentially expressed genes or transcripts is generated for further visualization and interpretation.
Bioconductor is a free, open source and open development software project for the analysis and comprehension of genomic data generated by wet lab experiments in molecular biology. Bioconductor is based primarily on the statistical R programming language, but does contain contributions in other programming languages. It has two releases each year that follow the semiannual releases of R. At any one time there is a release version, which corresponds to the released version of R, and a development version, which corresponds to the development version of R. Most users will find the release version appropriate for their needs.
Bioconductor has many packages which support analysis of high-throughput sequence data, including RNA sequencing (RNA-seq). The packages which we will cover in this tutorial include core packages maintained by the Bioconductor core team for importing and processing raw sequencing data and loading gene annotations. We will also cover contributed packages for statistical analysis and visualization of sequencing data. Through scheduled releases every 6 months, the Bioconductor project ensures that all the packages within a release will work together in harmony (hence the “conductor” metaphor).
The packages used in this tutorial are loaded with the library function and can be installed by following the Bioconductor package installation instructions or using the conda package manager. A list of all the packages loaded in this tutorial is included at the end, in the Session information section.
The commands used in the tutorial should all be run from the Unix shell prompt within a terminal window up to the differential expression analysis, which is performed in the R environment. We encourage users to create a single directory (e.g., ‘tutorial’) in which to store all example data and files created by the analysis. The tutorial also includes small sections of code to be run in the R statistical computing environment. Commands meant to be executed from the Unix shell (e.g., bash or csh) have the chunk title ‘BASH’. Commands that are meant to be run from either an R script or at the R interactive shell have the chunk title ‘R’.
Create a tutorial directory to store all data and output files:
mkdir tutorial
The tutorial is illustrated with an example experiment from chromosome 22 of Homo sapiens that you can use to familiarize yourself with the analysis of RNA-seq data. All of the data you will need are available in the file griffith.tar.gz, which you will have to download. The data consists of two commercially available RNA samples: Universal Human Reference (UHR) and Human Brain Reference (HBR). The UHR is total RNA isolated from a diverse set of 10 cancer cell lines. The HBR is total RNA isolated from the brains of 23 Caucasians, male and female, of varying age but mostly 60-80 years old.
Download the data archive file into the tutorial directory.
curl https://media.githubusercontent.com/media/zifornd/bioinformatics-workshop/main/data/rnaseq/griffith.tar.gz --output tutorial/griffith.tar.gz
## % Total % Received % Xferd Average Speed Time Time Time Current
## Dload Upload Total Spent Left Speed
##
0 0 0 0 0 0 0 0 --:--:-- --:--:-- --:--:-- 0
0 0 0 0 0 0 0 0 --:--:-- --:--:-- --:--:-- 0
0 0 0 0 0 0 0 0 --:--:-- 0:00:01 --:--:-- 0
0 0 0 0 0 0 0 0 --:--:-- 0:00:02 --:--:-- 0
0 0 0 0 0 0 0 0 --:--:-- 0:00:03 --:--:-- 0
1 125M 1 1871k 0 0 449k 0 0:04:45 0:00:04 0:04:41 450k
3 125M 3 4015k 0 0 777k 0 0:02:44 0:00:05 0:02:39 810k
4 125M 4 6175k 0 0 1002k 0 0:02:07 0:00:06 0:02:01 1247k
6 125M 6 8320k 0 0 1161k 0 0:01:50 0:00:07 0:01:43 1682k
8 125M 8 10.2M 0 0 1282k 0 0:01:40 0:00:08 0:01:32 2119k
9 125M 9 12.3M 0 0 1377k 0 0:01:33 0:00:09 0:01:24 2150k
10 125M 10 13.7M 0 0 1382k 0 0:01:32 0:00:10 0:01:22 2006k
12 125M 12 15.8M 0 0 1450k 0 0:01:28 0:00:11 0:01:17 2003k
14 125M 14 17.9M 0 0 1508k 0 0:01:24 0:00:12 0:01:12 2006k
15 125M 15 20.0M 0 0 1557k 0 0:01:22 0:00:13 0:01:09 2006k
17 125M 17 22.1M 0 0 1598k 0 0:01:20 0:00:14 0:01:06 2003k
19 125M 19 24.2M 0 0 1635k 0 0:01:18 0:00:15 0:01:03 2150k
20 125M 20 26.2M 0 0 1664k 0 0:01:17 0:00:16 0:01:01 2140k
22 125M 22 28.3M 0 0 1692k 0 0:01:15 0:00:17 0:00:58 2137k
24 125M 24 30.4M 0 0 1712k 0 0:01:14 0:00:18 0:00:56 2117k
25 125M 25 32.4M 0 0 1731k 0 0:01:14 0:00:19 0:00:55 2108k
26 125M 26 33.1M 0 0 1681k 0 0:01:16 0:00:20 0:00:56 1819k
27 125M 27 34.1M 0 0 1651k 0 0:01:17 0:00:21 0:00:56 1609k
28 125M 28 36.0M 0 0 1667k 0 0:01:16 0:00:22 0:00:54 1584k
30 125M 30 38.2M 0 0 1688k 0 0:01:15 0:00:23 0:00:52 1603k
32 125M 32 40.2M 0 0 1707k 0 0:01:15 0:00:24 0:00:51 1616k
33 125M 33 42.4M 0 0 1725k 0 0:01:14 0:00:25 0:00:49 1905k
35 125M 35 44.5M 0 0 1741k 0 0:01:13 0:00:26 0:00:47 2124k
37 125M 37 46.5M 0 0 1756k 0 0:01:13 0:00:27 0:00:46 2150k
38 125M 38 48.6M 0 0 1770k 0 0:01:12 0:00:28 0:00:44 2147k
40 125M 40 50.6M 0 0 1779k 0 0:01:12 0:00:29 0:00:43 2127k
42 125M 42 52.7M 0 0 1789k 0 0:01:11 0:00:30 0:00:41 2108k
43 125M 43 54.7M 0 0 1800k 0 0:01:11 0:00:31 0:00:40 2108k
45 125M 45 56.8M 0 0 1811k 0 0:01:10 0:00:32 0:00:38 2108k
47 125M 47 58.9M 0 0 1821k 0 0:01:10 0:00:33 0:00:37 2109k
48 125M 48 61.0M 0 0 1830k 0 0:01:10 0:00:34 0:00:36 2128k
50 125M 50 62.8M 0 0 1829k 0 0:01:10 0:00:35 0:00:35 2075k
51 125M 51 64.8M 0 0 1835k 0 0:01:09 0:00:36 0:00:33 2053k
53 125M 53 66.9M 0 0 1843k 0 0:01:09 0:00:37 0:00:32 2051k
55 125M 55 69.0M 0 0 1851k 0 0:01:09 0:00:38 0:00:31 2051k
56 125M 56 70.9M 0 0 1856k 0 0:01:09 0:00:39 0:00:30 2029k
58 125M 58 73.0M 0 0 1858k 0 0:01:08 0:00:40 0:00:28 2058k
59 125M 59 74.6M 0 0 1858k 0 0:01:09 0:00:41 0:00:28 2019k
61 125M 61 76.7M 0 0 1862k 0 0:01:08 0:00:42 0:00:26 2005k
62 125M 62 78.7M 0 0 1867k 0 0:01:08 0:00:43 0:00:25 1993k
64 125M 64 80.8M 0 0 1873k 0 0:01:08 0:00:44 0:00:24 2012k
66 125M 66 82.9M 0 0 1879k 0 0:01:08 0:00:45 0:00:23 2053k
67 125M 67 84.9M 0 0 1884k 0 0:01:08 0:00:46 0:00:22 2102k
69 125M 69 87.0M 0 0 1890k 0 0:01:07 0:00:47 0:00:20 2118k
71 125M 71 89.1M 0 0 1895k 0 0:01:07 0:00:48 0:00:19 2134k
72 125M 72 91.2M 0 0 1900k 0 0:01:07 0:00:49 0:00:18 2137k
74 125M 74 93.2M 0 0 1904k 0 0:01:07 0:00:50 0:00:17 2124k
76 125M 76 95.3M 0 0 1907k 0 0:01:07 0:00:51 0:00:16 2117k
77 125M 77 97.0M 0 0 1905k 0 0:01:07 0:00:52 0:00:15 2047k
79 125M 79 98.9M 0 0 1906k 0 0:01:07 0:00:53 0:00:14 2006k
79 125M 79 100M 0 0 1892k 0 0:01:07 0:00:54 0:00:13 1814k
81 125M 81 102M 0 0 1895k 0 0:01:07 0:00:55 0:00:12 1811k
83 125M 83 104M 0 0 1900k 0 0:01:07 0:00:56 0:00:11 1827k
84 125M 84 106M 0 0 1903k 0 0:01:07 0:00:57 0:00:10 1885k
86 125M 86 108M 0 0 1907k 0 0:01:07 0:00:58 0:00:09 1916k
88 125M 88 110M 0 0 1909k 0 0:01:07 0:00:59 0:00:08 2092k
89 125M 89 112M 0 0 1913k 0 0:01:07 0:01:00 0:00:07 2108k
91 125M 91 114M 0 0 1916k 0 0:01:06 0:01:01 0:00:05 2099k
93 125M 93 116M 0 0 1920k 0 0:01:06 0:01:02 0:00:04 2112k
94 125M 94 118M 0 0 1923k 0 0:01:06 0:01:03 0:00:03 2121k
96 125M 96 120M 0 0 1927k 0 0:01:06 0:01:04 0:00:02 2137k
98 125M 98 122M 0 0 1930k 0 0:01:06 0:01:05 0:00:01 2137k
99 125M 99 124M 0 0 1933k 0 0:01:06 0:01:06 --:--:-- 2147k
100 125M 100 125M 0 0 1934k 0 0:01:06 0:01:06 --:--:-- 2149k
Extract the data archive file into the tutorial directory.
tar xf tutorial/griffith.tar.gz --directory=tutorial
Inspect the contents of the tutorial data.
ls tutorial/data
## conda.txt
## genes
## genome
## samples
## samples.tsv
## targets.txt
The data archive file expands to contain a single directory data with three directories:
ls tutorial/data/samples
## HBR_Rep1_1.fastq.gz
## HBR_Rep1_2.fastq.gz
## HBR_Rep2_1.fastq.gz
## HBR_Rep2_2.fastq.gz
## HBR_Rep3_1.fastq.gz
## HBR_Rep3_2.fastq.gz
## UHR_Rep1_1.fastq.gz
## UHR_Rep1_2.fastq.gz
## UHR_Rep2_1.fastq.gz
## UHR_Rep2_2.fastq.gz
## UHR_Rep3_1.fastq.gz
## UHR_Rep3_2.fastq.gz
ls tutorial/data/genome
## chr22.fa
ls tutorial/data/genes
## chr22.gtf
The data directory also contains three text files:
cat tutorial/data/conda.txt
## bioconductor-deseq2
## bioconductor-rsubread
## bioconductor-tximport
## gffread
## hisat2
## kallisto
## samtools
cat tutorial/data/targets.txt
## HBR_Rep1
## HBR_Rep2
## HBR_Rep3
## UHR_Rep1
## UHR_Rep2
## UHR_Rep3
cat tutorial/data/samples.tsv
## sample condition
## HBR_Rep1 HBR
## HBR_Rep2 HBR
## HBR_Rep3 HBR
## UHR_Rep1 UHR
## UHR_Rep2 UHR
## UHR_Rep3 UHR
Normally you would create these files yourself in a text editor, but we provide them here for use as examples. Finally, create an output directory to store all the output files.
rm -rf tutorial/output
mkdir tutorial/output
All of the software required for this tutorial can be installed using the conda package manager. The data directory contains a text file called conda.txt which can be used to create a conda environment.
Create a new environment from the conda.txt file.
conda create --name rnaseq --yes --file tutorial/data/conda.txt
## Collecting package metadata (current_repodata.json): ...working... done
## Solving environment: ...working... failed with repodata from current_repodata.json, will retry with next repodata source.
## Collecting package metadata (repodata.json): ...working... done
## Solving environment: ...working... done
##
## ## Package Plan ##
##
## environment location: /opt/miniconda3/envs/rnaseq
##
## added / updated specs:
## - bioconductor-deseq2
## - bioconductor-rsubread
## - bioconductor-tximport
## - gffread
## - hisat2
## - kallisto
## - samtools
##
##
## The following NEW packages will be INSTALLED:
##
## _r-mutex conda-forge/noarch::_r-mutex-1.0.1-anacondar_1
## bioconductor-anno~ bioconda/noarch::bioconductor-annotate-1.72.0-r41hdfd78af_0
## bioconductor-anno~ bioconda/noarch::bioconductor-annotationdbi-1.56.1-r41hdfd78af_0
## bioconductor-biob~ bioconda/osx-64::bioconductor-biobase-2.54.0-r41h3be46a4_2
## bioconductor-bioc~ bioconda/noarch::bioconductor-biocgenerics-0.40.0-r41hdfd78af_0
## bioconductor-bioc~ bioconda/osx-64::bioconductor-biocparallel-1.28.3-r41hb890f52_1
## bioconductor-bios~ bioconda/osx-64::bioconductor-biostrings-2.62.0-r41h3be46a4_2
## bioconductor-dela~ bioconda/osx-64::bioconductor-delayedarray-0.20.0-r41h3be46a4_2
## bioconductor-dese~ bioconda/osx-64::bioconductor-deseq2-1.34.0-r41hb890f52_2
## bioconductor-gene~ bioconda/osx-64::bioconductor-genefilter-1.76.0-r41h238a2e4_2
## bioconductor-gene~ bioconda/noarch::bioconductor-geneplotter-1.72.0-r41hdfd78af_0
## bioconductor-geno~ bioconda/noarch::bioconductor-genomeinfodb-1.30.0-r41hdfd78af_0
## bioconductor-geno~ bioconda/noarch::bioconductor-genomeinfodbdata-1.2.7-r41hdfd78af_2
## bioconductor-geno~ bioconda/osx-64::bioconductor-genomicranges-1.46.1-r41h3be46a4_1
## bioconductor-iran~ bioconda/osx-64::bioconductor-iranges-2.28.0-r41h3be46a4_2
## bioconductor-kegg~ bioconda/noarch::bioconductor-keggrest-1.34.0-r41hdfd78af_0
## bioconductor-matr~ bioconda/noarch::bioconductor-matrixgenerics-1.6.0-r41hdfd78af_0
## bioconductor-rsub~ bioconda/osx-64::bioconductor-rsubread-2.8.1-r41haba8685_0
## bioconductor-s4ve~ bioconda/osx-64::bioconductor-s4vectors-0.32.4-r41h3be46a4_0
## bioconductor-summ~ bioconda/noarch::bioconductor-summarizedexperiment-1.24.0-r41hdfd78af_0
## bioconductor-txim~ bioconda/noarch::bioconductor-tximport-1.22.0-r41hdfd78af_0
## bioconductor-xvec~ bioconda/osx-64::bioconductor-xvector-0.34.0-r41h3be46a4_2
## bioconductor-zlib~ bioconda/osx-64::bioconductor-zlibbioc-1.40.0-r41h3be46a4_2
## bwidget conda-forge/osx-64::bwidget-1.9.14-h694c41f_1
## bzip2 conda-forge/osx-64::bzip2-1.0.8-h0d85af4_4
## c-ares conda-forge/osx-64::c-ares-1.18.1-h0d85af4_0
## ca-certificates conda-forge/osx-64::ca-certificates-2022.9.14-h033912b_0
## cairo conda-forge/osx-64::cairo-1.16.0-h904041c_1014
## cctools_osx-64 conda-forge/osx-64::cctools_osx-64-973.0.1-h2b95895_10
## clang conda-forge/osx-64::clang-14.0.4-h694c41f_0
## clang-14 conda-forge/osx-64::clang-14-14.0.4-default_h55ffa42_0
## clang_osx-64 conda-forge/osx-64::clang_osx-64-14.0.4-h3a95cd4_2
## clangxx conda-forge/osx-64::clangxx-14.0.4-default_h55ffa42_0
## clangxx_osx-64 conda-forge/osx-64::clangxx_osx-64-14.0.4-he1dbc44_2
## compiler-rt conda-forge/osx-64::compiler-rt-14.0.4-h7fcd477_0
## compiler-rt_osx-64 conda-forge/noarch::compiler-rt_osx-64-14.0.4-h6df654d_0
## curl conda-forge/osx-64::curl-7.83.1-h23f1065_0
## expat conda-forge/osx-64::expat-2.4.9-hf0c8a7f_0
## font-ttf-dejavu-s~ conda-forge/noarch::font-ttf-dejavu-sans-mono-2.37-hab24e00_0
## font-ttf-inconsol~ conda-forge/noarch::font-ttf-inconsolata-3.000-h77eed37_0
## font-ttf-source-c~ conda-forge/noarch::font-ttf-source-code-pro-2.038-h77eed37_0
## font-ttf-ubuntu conda-forge/noarch::font-ttf-ubuntu-0.83-hab24e00_0
## fontconfig conda-forge/osx-64::fontconfig-2.14.0-h5bb23bf_1
## fonts-conda-ecosy~ conda-forge/noarch::fonts-conda-ecosystem-1-0
## fonts-conda-forge conda-forge/noarch::fonts-conda-forge-1-0
## freetype conda-forge/osx-64::freetype-2.12.1-h3f81eb7_0
## fribidi conda-forge/osx-64::fribidi-1.0.10-hbcb3906_0
## gettext conda-forge/osx-64::gettext-0.19.8.1-hd1a6beb_1008
## gffread bioconda/osx-64::gffread-0.12.7-h6151dfb_1
## gfortran_impl_osx~ conda-forge/osx-64::gfortran_impl_osx-64-9.5.0-h2221f41_25
## gfortran_osx-64 conda-forge/osx-64::gfortran_osx-64-9.5.0-h18f7dce_0
## gmp conda-forge/osx-64::gmp-6.2.1-h2e338ed_0
## graphite2 conda-forge/osx-64::graphite2-1.3.13-h2e338ed_1001
## gsl conda-forge/osx-64::gsl-2.7-h93259b0_0
## harfbuzz conda-forge/osx-64::harfbuzz-5.2.0-h52d05a2_0
## hdf5 conda-forge/osx-64::hdf5-1.12.1-nompi_h0aa1fa2_104
## hisat2 bioconda/osx-64::hisat2-2.2.1-h9722bc1_4
## htslib bioconda/osx-64::htslib-1.16-h567f53e_0
## icu conda-forge/osx-64::icu-70.1-h96cf925_0
## isl conda-forge/osx-64::isl-0.22.1-hb1e8313_2
## jpeg conda-forge/osx-64::jpeg-9e-hac89ed1_2
## kallisto bioconda/osx-64::kallisto-0.48.0-h0504637_2
## krb5 conda-forge/osx-64::krb5-1.19.3-hb98e516_0
## ld64_osx-64 conda-forge/osx-64::ld64_osx-64-609-h1e06c2b_10
## lerc conda-forge/osx-64::lerc-4.0.0-hb486fe8_0
## libblas conda-forge/osx-64::libblas-3.9.0-16_osx64_openblas
## libcblas conda-forge/osx-64::libcblas-3.9.0-16_osx64_openblas
## libclang-cpp14 conda-forge/osx-64::libclang-cpp14-14.0.4-default_h55ffa42_0
## libcurl conda-forge/osx-64::libcurl-7.83.1-h23f1065_0
## libcxx conda-forge/osx-64::libcxx-14.0.6-hccf4f1f_0
## libdeflate conda-forge/osx-64::libdeflate-1.13-h775f41a_0
## libedit conda-forge/osx-64::libedit-3.1.20191231-h0678c8f_2
## libev conda-forge/osx-64::libev-4.33-haf1e3a3_1
## libffi conda-forge/osx-64::libffi-3.4.2-h0d85af4_5
## libgfortran conda-forge/osx-64::libgfortran-5.0.0-10_4_0_h97931a8_25
## libgfortran-devel~ conda-forge/noarch::libgfortran-devel_osx-64-9.5.0-hc2d6858_25
## libgfortran5 conda-forge/osx-64::libgfortran5-11.3.0-h082f757_25
## libglib conda-forge/osx-64::libglib-2.72.1-hfbcb929_0
## libiconv conda-forge/osx-64::libiconv-1.16-haf1e3a3_0
## liblapack conda-forge/osx-64::liblapack-3.9.0-16_osx64_openblas
## libllvm14 conda-forge/osx-64::libllvm14-14.0.4-h41df66c_0
## libnghttp2 conda-forge/osx-64::libnghttp2-1.47.0-h5aae05b_1
## libopenblas conda-forge/osx-64::libopenblas-0.3.21-openmp_h429af6e_3
## libpng conda-forge/osx-64::libpng-1.6.38-ha978bb4_0
## libsqlite conda-forge/osx-64::libsqlite-3.39.3-ha978bb4_0
## libssh2 conda-forge/osx-64::libssh2-1.10.0-h47af595_3
## libtiff conda-forge/osx-64::libtiff-4.4.0-h5e0c7b4_3
## libwebp-base conda-forge/osx-64::libwebp-base-1.2.4-h775f41a_0
## libxml2 conda-forge/osx-64::libxml2-2.9.14-hea49891_4
## libzlib conda-forge/osx-64::libzlib-1.2.12-hfd90126_3
## llvm-openmp conda-forge/osx-64::llvm-openmp-14.0.4-ha654fa7_0
## llvm-tools conda-forge/osx-64::llvm-tools-14.0.4-h41df66c_0
## make conda-forge/osx-64::make-4.3-h22f3db7_1
## mpc conda-forge/osx-64::mpc-1.2.1-hbb51d92_0
## mpfr conda-forge/osx-64::mpfr-4.1.0-h0f52abe_1
## ncurses conda-forge/osx-64::ncurses-6.3-h96cf925_1
## openssl conda-forge/osx-64::openssl-3.0.5-hfd90126_2
## pango conda-forge/osx-64::pango-1.50.10-h1f197d0_0
## pcre conda-forge/osx-64::pcre-8.45-he49afe7_0
## pcre2 conda-forge/osx-64::pcre2-10.37-h3f55489_1
## perl conda-forge/osx-64::perl-5.32.1-2_h0d85af4_perl5
## pip conda-forge/noarch::pip-22.2.2-pyhd8ed1ab_0
## pixman conda-forge/osx-64::pixman-0.40.0-hbcb3906_0
## python conda-forge/osx-64::python-3.10.6-hc14f532_0_cpython
## r-askpass conda-forge/osx-64::r-askpass-1.1-r41h28b5c78_2
## r-backports conda-forge/osx-64::r-backports-1.4.1-r41h28b5c78_0
## r-base conda-forge/osx-64::r-base-4.1.3-h85fa70c_2
## r-bh conda-forge/noarch::r-bh-1.78.0_0-r41hc72bb7e_0
## r-bit conda-forge/osx-64::r-bit-4.0.4-r41h28b5c78_0
## r-bit64 conda-forge/osx-64::r-bit64-4.0.5-r41h28b5c78_0
## r-bitops conda-forge/osx-64::r-bitops-1.0_7-r41h67d6963_0
## r-blob conda-forge/noarch::r-blob-1.2.3-r41hc72bb7e_0
## r-brio conda-forge/osx-64::r-brio-1.1.3-r41h28b5c78_0
## r-cachem conda-forge/osx-64::r-cachem-1.0.6-r41h28b5c78_0
## r-callr conda-forge/noarch::r-callr-3.7.2-r41hc72bb7e_0
## r-cli conda-forge/osx-64::r-cli-3.4.1-r41h49197e3_0
## r-colorspace conda-forge/osx-64::r-colorspace-2.0_3-r41h0f1d5c4_0
## r-crayon conda-forge/noarch::r-crayon-1.5.1-r41hc72bb7e_0
## r-curl conda-forge/osx-64::r-curl-4.3.2-r41h28b5c78_0
## r-dbi conda-forge/noarch::r-dbi-1.1.3-r41hc72bb7e_0
## r-desc conda-forge/noarch::r-desc-1.4.2-r41hc72bb7e_0
## r-diffobj conda-forge/osx-64::r-diffobj-0.3.5-r41h28b5c78_0
## r-digest conda-forge/osx-64::r-digest-0.6.29-r41h9951f98_0
## r-ellipsis conda-forge/osx-64::r-ellipsis-0.3.2-r41h28b5c78_0
## r-evaluate conda-forge/noarch::r-evaluate-0.16-r41hc72bb7e_0
## r-fansi conda-forge/osx-64::r-fansi-1.0.3-r41h0f1d5c4_0
## r-farver conda-forge/osx-64::r-farver-2.1.1-r41h8619c4b_0
## r-fastmap conda-forge/osx-64::r-fastmap-1.1.0-r41h9951f98_0
## r-formatr conda-forge/noarch::r-formatr-1.12-r41hc72bb7e_0
## r-fs conda-forge/osx-64::r-fs-1.5.2-r41hc4bb905_1
## r-futile.logger conda-forge/noarch::r-futile.logger-1.4.3-r41hc72bb7e_1003
## r-futile.options conda-forge/noarch::r-futile.options-1.0.1-r41hc72bb7e_1002
## r-ggplot2 conda-forge/noarch::r-ggplot2-3.3.6-r41hc72bb7e_0
## r-glue conda-forge/osx-64::r-glue-1.6.2-r41h0f1d5c4_0
## r-gtable conda-forge/noarch::r-gtable-0.3.1-r41hc72bb7e_0
## r-httr conda-forge/noarch::r-httr-1.4.4-r41hc72bb7e_0
## r-isoband conda-forge/osx-64::r-isoband-0.2.5-r41h9951f98_0
## r-jsonlite conda-forge/osx-64::r-jsonlite-1.8.0-r41h0f1d5c4_0
## r-labeling conda-forge/noarch::r-labeling-0.4.2-r41hc72bb7e_1
## r-lambda.r conda-forge/noarch::r-lambda.r-1.2.4-r41hc72bb7e_1
## r-lattice conda-forge/osx-64::r-lattice-0.20_45-r41h28b5c78_0
## r-lifecycle conda-forge/noarch::r-lifecycle-1.0.2-r41hc72bb7e_0
## r-locfit conda-forge/osx-64::r-locfit-1.5_9.6-r41h67d6963_0
## r-magrittr conda-forge/osx-64::r-magrittr-2.0.3-r41h0f1d5c4_0
## r-mass conda-forge/osx-64::r-mass-7.3_58.1-r41hef1f586_0
## r-matrix conda-forge/osx-64::r-matrix-1.4_1-r41ha2825d1_0
## r-matrixstats conda-forge/osx-64::r-matrixstats-0.62.0-r41h0f1d5c4_0
## r-memoise conda-forge/noarch::r-memoise-2.0.1-r41hc72bb7e_0
## r-mgcv conda-forge/osx-64::r-mgcv-1.8_40-r41h60b693f_0
## r-mime conda-forge/osx-64::r-mime-0.12-r41h28b5c78_0
## r-munsell conda-forge/noarch::r-munsell-0.5.0-r41hc72bb7e_1004
## r-nlme conda-forge/osx-64::r-nlme-3.1_159-r41h225c1e1_0
## r-openssl conda-forge/osx-64::r-openssl-2.0.3-r41hfeb9312_0
## r-pillar conda-forge/noarch::r-pillar-1.8.1-r41hc72bb7e_0
## r-pkgconfig conda-forge/noarch::r-pkgconfig-2.0.3-r41hc72bb7e_1
## r-pkgload conda-forge/noarch::r-pkgload-1.3.0-r41hc72bb7e_0
## r-plogr conda-forge/noarch::r-plogr-0.2.0-r41hc72bb7e_1003
## r-png conda-forge/osx-64::r-png-0.1_7-r41h28b5c78_1004
## r-praise conda-forge/noarch::r-praise-1.0.0-r41hc72bb7e_1005
## r-processx conda-forge/osx-64::r-processx-3.7.0-r41h67d6963_0
## r-ps conda-forge/osx-64::r-ps-1.7.1-r41h67d6963_0
## r-r6 conda-forge/noarch::r-r6-2.5.1-r41hc72bb7e_0
## r-rcolorbrewer conda-forge/noarch::r-rcolorbrewer-1.1_3-r41h785f33e_0
## r-rcpp conda-forge/osx-64::r-rcpp-1.0.9-r41h49197e3_1
## r-rcpparmadillo conda-forge/osx-64::r-rcpparmadillo-0.11.2.3.1-r41hf5e6a41_0
## r-rcurl conda-forge/osx-64::r-rcurl-1.98_1.8-r41h67d6963_0
## r-rematch2 conda-forge/noarch::r-rematch2-2.1.2-r41hc72bb7e_1
## r-rlang conda-forge/osx-64::r-rlang-1.0.5-r41h49197e3_0
## r-rprojroot conda-forge/noarch::r-rprojroot-2.0.3-r41hc72bb7e_0
## r-rsqlite conda-forge/osx-64::r-rsqlite-2.2.8-r41h9951f98_0
## r-scales conda-forge/noarch::r-scales-1.2.1-r41hc72bb7e_0
## r-snow conda-forge/noarch::r-snow-0.4_4-r41hc72bb7e_0
## r-survival conda-forge/osx-64::r-survival-3.4_0-r41hef1f586_0
## r-sys conda-forge/osx-64::r-sys-3.4-r41h28b5c78_0
## r-testthat conda-forge/osx-64::r-testthat-3.1.4-r41h8619c4b_0
## r-tibble conda-forge/osx-64::r-tibble-3.1.8-r41h67d6963_0
## r-utf8 conda-forge/osx-64::r-utf8-1.2.2-r41h28b5c78_0
## r-vctrs conda-forge/osx-64::r-vctrs-0.4.1-r41hc4bb905_0
## r-viridislite conda-forge/noarch::r-viridislite-0.4.1-r41hc72bb7e_0
## r-waldo conda-forge/noarch::r-waldo-0.4.0-r41hc72bb7e_0
## r-withr conda-forge/noarch::r-withr-2.5.0-r41hc72bb7e_0
## r-xml conda-forge/osx-64::r-xml-3.99_0.10-r41h67d6963_0
## r-xtable conda-forge/noarch::r-xtable-1.8_4-r41hc72bb7e_3
## readline conda-forge/osx-64::readline-8.1.2-h3899abd_0
## samtools bioconda/osx-64::samtools-1.15.1-h7e39424_1
## setuptools conda-forge/noarch::setuptools-65.3.0-pyhd8ed1ab_1
## sigtool conda-forge/osx-64::sigtool-0.1.3-h88f4db0_0
## tapi conda-forge/osx-64::tapi-1100.0.11-h9ce4665_0
## tk conda-forge/osx-64::tk-8.6.12-h5dbffcc_0
## tktable conda-forge/osx-64::tktable-2.10-h49f0cf7_3
## tzdata conda-forge/noarch::tzdata-2022c-h191b570_0
## wheel conda-forge/noarch::wheel-0.37.1-pyhd8ed1ab_0
## xz conda-forge/osx-64::xz-5.2.6-h775f41a_0
## zlib conda-forge/osx-64::zlib-1.2.12-hfd90126_3
## zstd conda-forge/osx-64::zstd-1.5.2-hfa58983_4
##
##
## Preparing transaction: ...working... done
## Verifying transaction: ...working...
## SafetyError: The package for r-base located at /opt/miniconda3/pkgs/r-base-4.1.3-h85fa70c_2
## appears to be corrupted. The path 'lib/R/doc/html/packages.html'
## has an incorrect size.
## reported size: 3061 bytes
## actual size: 22896 bytes
##
##
## done
## Executing transaction: ...working... done
## #
## # To activate this environment, use
## #
## # $ conda activate rnaseq
## #
## # To deactivate an active environment, use
## #
## # $ conda deactivate
##
## Retrieving notices: ...working... done
Activate the new environment.
conda activate rnaseq
Verify that the new environment was installed correctly.
conda env list
## # conda environments:
## #
## /Users/James/GitHub/array/.tests/integration/.snakemake/conda/19447c9d83fc6a16df8285b6af34bd39
## /Users/James/GitHub/array/.tests/integration/.snakemake/conda/e983188723b3b36090d186b4259c200b
## /Users/James/GitHub/array/.tests/integration/resources/bioconductor/annotation
## /Users/James/GitHub/array/.tests/integration/resources/bioconductor/organism
## /Users/James/GitHub/array/.tests/integration/resources/bioconductor/platform
## /Users/James/Library/Caches/org.R-project.R/R/basilisk/1.6.0/0
## /Users/James/Library/Caches/org.R-project.R/R/basilisk/1.6.0/velociraptor/1.4.0/env
## base * /opt/miniconda3
## DiasTailbudData /opt/miniconda3/envs/DiasTailbudData
## aspera-cli /opt/miniconda3/envs/aspera-cli
## bioconductor-rsconnect /opt/miniconda3/envs/bioconductor-rsconnect
## ffq /opt/miniconda3/envs/ffq
## genomepy /opt/miniconda3/envs/genomepy
## kb-python /opt/miniconda3/envs/kb-python
## openjdk /opt/miniconda3/envs/openjdk
## parallel /opt/miniconda3/envs/parallel
## rnaseq /opt/miniconda3/envs/rnaseq
## samtools /opt/miniconda3/envs/samtools
## seqtk /opt/miniconda3/envs/seqtk
## sevenbridges-python /opt/miniconda3/envs/sevenbridges-python
## snakemake /opt/miniconda3/envs/snakemake
## subread /opt/miniconda3/envs/subread
After sequencing has been completed, the starting point for analysis is the data files, which contain base-called sequencing reads, usually in the form of FASTQ files. The most common first step in processing these files is to map sequence reads to a known transcriptome (or annotated genome, converting each sequence read to one or more genomic coordinates. This process has traditionally been accomplished using distinct alignment tools, such as TopHat, STAR or HISAT, which rely on a reference genome. Because the sequenced cDNA is derived from RNA, which may span exon boundaries, these tools perform a spliced alignment allowing for gaps in the reads when compared to the reference genome (which contains introns as well as exons). In this step of the workflow we will use HISAT2, the second major release of the HISAT splice-aware aligner.
Create a directory to store HISAT2 output files.
mkdir tutorial/output/hisat2
Build the HISAT2 index.
conda activate rnaseq
hisat2-build tutorial/data/genome/chr22.fa tutorial/output/hisat2/chr22.fa
## Settings:
## Output files: "tutorial/output/hisat2/chr22.fa.*.ht2"
## Line rate: 6 (line is 64 bytes)
## Lines per side: 1 (side is 64 bytes)
## Offset rate: 4 (one in 16)
## FTable chars: 10
## Strings: unpacked
## Local offset rate: 3 (one in 8)
## Local fTable chars: 6
## Local sequence length: 57344
## Local sequence overlap between two consecutive indexes: 1024
## Endianness: little
## Actual local endianness: little
## Sanity checking: disabled
## Assertions: disabled
## Random seed: 0
## Sizeofs: void*:8, int:4, long:8, size_t:8
## Input files DNA, FASTA:
## tutorial/data/genome/chr22.fa
## Reading reference sizes
## Time reading reference sizes: 00:00:00
## Calculating joined length
## Writing header
## Reserving space for joined string
## Joining reference sequences
## Time to join reference sequences: 00:00:01
## Time to read SNPs and splice sites: 00:00:00
## Using parameters --bmax 7342458 --dcv 1024
## Doing ahead-of-time memory usage test
## Passed! Constructing with these parameters: --bmax 7342458 --dcv 1024
## Constructing suffix-array element generator
## Building DifferenceCoverSample
## Building sPrime
## Building sPrimeOrder
## V-Sorting samples
## V-Sorting samples time: 00:00:00
## Allocating rank array
## Ranking v-sort output
## Ranking v-sort output time: 00:00:00
## Invoking Larsson-Sadakane on ranks
## Invoking Larsson-Sadakane on ranks time: 00:00:00
## Sanity-checking and returning
## Building samples
## Reserving space for 12 sample suffixes
## Generating random suffixes
## QSorting 12 sample offsets, eliminating duplicates
## QSorting sample offsets, eliminating duplicates time: 00:00:00
## Multikey QSorting 12 samples
## (Using difference cover)
## Multikey QSorting samples time: 00:00:00
## Calculating bucket sizes
## Splitting and merging
## Splitting and merging time: 00:00:00
## Avg bucket size: 4.89497e+06 (target: 7342457)
## Converting suffix-array elements to index image
## Allocating ftab, absorbFtab
## Entering GFM loop
## Getting block 1 of 8
## Reserving size (7342458) for bucket 1
## Calculating Z arrays for bucket 1
## Entering block accumulator loop for bucket 1:
## bucket 1: 10%
## bucket 1: 20%
## bucket 1: 30%
## bucket 1: 40%
## bucket 1: 50%
## bucket 1: 60%
## bucket 1: 70%
## bucket 1: 80%
## bucket 1: 90%
## bucket 1: 100%
## Sorting block of length 6109124 for bucket 1
## (Using difference cover)
## Sorting block time: 00:00:01
## Returning block of 6109125 for bucket 1
## Getting block 2 of 8
## Reserving size (7342458) for bucket 2
## Calculating Z arrays for bucket 2
## Entering block accumulator loop for bucket 2:
## bucket 2: 10%
## bucket 2: 20%
## bucket 2: 30%
## bucket 2: 40%
## bucket 2: 50%
## bucket 2: 60%
## bucket 2: 70%
## bucket 2: 80%
## bucket 2: 90%
## bucket 2: 100%
## Sorting block of length 4162165 for bucket 2
## (Using difference cover)
## Sorting block time: 00:00:00
## Returning block of 4162166 for bucket 2
## Getting block 3 of 8
## Reserving size (7342458) for bucket 3
## Calculating Z arrays for bucket 3
## Entering block accumulator loop for bucket 3:
## bucket 3: 10%
## bucket 3: 20%
## bucket 3: 30%
## bucket 3: 40%
## bucket 3: 50%
## bucket 3: 60%
## bucket 3: 70%
## bucket 3: 80%
## bucket 3: 90%
## bucket 3: 100%
## Sorting block of length 3242487 for bucket 3
## (Using difference cover)
## Sorting block time: 00:00:00
## Returning block of 3242488 for bucket 3
## Getting block 4 of 8
## Reserving size (7342458) for bucket 4
## Calculating Z arrays for bucket 4
## Entering block accumulator loop for bucket 4:
## bucket 4: 10%
## bucket 4: 20%
## bucket 4: 30%
## bucket 4: 40%
## bucket 4: 50%
## bucket 4: 60%
## bucket 4: 70%
## bucket 4: 80%
## bucket 4: 90%
## bucket 4: 100%
## Sorting block of length 5059508 for bucket 4
## (Using difference cover)
## Sorting block time: 00:00:00
## Returning block of 5059509 for bucket 4
## Getting block 5 of 8
## Reserving size (7342458) for bucket 5
## Calculating Z arrays for bucket 5
## Entering block accumulator loop for bucket 5:
## bucket 5: 10%
## bucket 5: 20%
## bucket 5: 30%
## bucket 5: 40%
## bucket 5: 50%
## bucket 5: 60%
## bucket 5: 70%
## bucket 5: 80%
## bucket 5: 90%
## bucket 5: 100%
## Sorting block of length 2717729 for bucket 5
## (Using difference cover)
## Sorting block time: 00:00:01
## Returning block of 2717730 for bucket 5
## Getting block 6 of 8
## Reserving size (7342458) for bucket 6
## Calculating Z arrays for bucket 6
## Entering block accumulator loop for bucket 6:
## bucket 6: 10%
## bucket 6: 20%
## bucket 6: 30%
## bucket 6: 40%
## bucket 6: 50%
## bucket 6: 60%
## bucket 6: 70%
## bucket 6: 80%
## bucket 6: 90%
## bucket 6: 100%
## Sorting block of length 6052533 for bucket 6
## (Using difference cover)
## Sorting block time: 00:00:01
## Returning block of 6052534 for bucket 6
## Getting block 7 of 8
## Reserving size (7342458) for bucket 7
## Calculating Z arrays for bucket 7
## Entering block accumulator loop for bucket 7:
## bucket 7: 10%
## bucket 7: 20%
## bucket 7: 30%
## bucket 7: 40%
## bucket 7: 50%
## bucket 7: 60%
## bucket 7: 70%
## bucket 7: 80%
## bucket 7: 90%
## bucket 7: 100%
## Sorting block of length 6696876 for bucket 7
## (Using difference cover)
## Sorting block time: 00:00:01
## Returning block of 6696877 for bucket 7
## Getting block 8 of 8
## Reserving size (7342458) for bucket 8
## Calculating Z arrays for bucket 8
## Entering block accumulator loop for bucket 8:
## bucket 8: 10%
## bucket 8: 20%
## bucket 8: 30%
## bucket 8: 40%
## bucket 8: 50%
## bucket 8: 60%
## bucket 8: 70%
## bucket 8: 80%
## bucket 8: 90%
## bucket 8: 100%
## Sorting block of length 5119348 for bucket 8
## (Using difference cover)
## Sorting block time: 00:00:01
## Returning block of 5119349 for bucket 8
## Exited GFM loop
## fchr[A]: 0
## fchr[C]: 10382214
## fchr[G]: 19542866
## fchr[T]: 28789052
## fchr[$]: 39159777
## Exiting GFM::buildToDisk()
## Returning from initFromVector
## Wrote 17248367 bytes to primary GFM file: tutorial/output/hisat2/chr22.fa.1.ht2
## Wrote 9789952 bytes to secondary GFM file: tutorial/output/hisat2/chr22.fa.2.ht2
## Re-opening _in1 and _in2 as input streams
## Returning from GFM constructor
## Returning from initFromVector
## Wrote 17349465 bytes to primary GFM file: tutorial/output/hisat2/chr22.fa.5.ht2
## Wrote 9969772 bytes to secondary GFM file: tutorial/output/hisat2/chr22.fa.6.ht2
## Re-opening _in5 and _in5 as input streams
## Returning from HGFM constructor
## Headers:
## len: 39159777
## gbwtLen: 39159778
## nodes: 39159778
## sz: 9789945
## gbwtSz: 9789945
## lineRate: 6
## offRate: 4
## offMask: 0xfffffff0
## ftabChars: 10
## eftabLen: 0
## eftabSz: 0
## ftabLen: 1048577
## ftabSz: 4194308
## offsLen: 2447487
## offsSz: 9789948
## lineSz: 64
## sideSz: 64
## sideGbwtSz: 48
## sideGbwtLen: 192
## numSides: 203958
## numLines: 203958
## gbwtTotLen: 13053312
## gbwtTotSz: 13053312
## reverse: 0
## linearFM: Yes
## Total time for call to driver() for forward index: 00:00:16
Map the reads for each sample to the reference genome.
conda activate rnaseq
while read SAMPLE; do
IDX=tutorial/output/hisat2/chr22.fa
FQ1=tutorial/data/samples/${SAMPLE}_1.fastq.gz
FQ2=tutorial/data/samples/${SAMPLE}_2.fastq.gz
SAM=tutorial/output/hisat2/${SAMPLE}.sam
hisat2 $IDX -1 $FQ1 -2 $FQ2 -S $SAM
done < tutorial/data/targets.txt
## 118571 reads; of these:
## 118571 (100.00%) were paired; of these:
## 56230 (47.42%) aligned concordantly 0 times
## 61769 (52.09%) aligned concordantly exactly 1 time
## 572 (0.48%) aligned concordantly >1 times
## ----
## 56230 pairs aligned concordantly 0 times; of these:
## 173 (0.31%) aligned discordantly 1 time
## ----
## 56057 pairs aligned 0 times concordantly or discordantly; of these:
## 112114 mates make up the pairs; of these:
## 112061 (99.95%) aligned 0 times
## 48 (0.04%) aligned exactly 1 time
## 5 (0.00%) aligned >1 times
## 52.75% overall alignment rate
## 144826 reads; of these:
## 144826 (100.00%) were paired; of these:
## 69045 (47.67%) aligned concordantly 0 times
## 74997 (51.78%) aligned concordantly exactly 1 time
## 784 (0.54%) aligned concordantly >1 times
## ----
## 69045 pairs aligned concordantly 0 times; of these:
## 241 (0.35%) aligned discordantly 1 time
## ----
## 68804 pairs aligned 0 times concordantly or discordantly; of these:
## 137608 mates make up the pairs; of these:
## 137563 (99.97%) aligned 0 times
## 43 (0.03%) aligned exactly 1 time
## 2 (0.00%) aligned >1 times
## 52.51% overall alignment rate
## 129786 reads; of these:
## 129786 (100.00%) were paired; of these:
## 61609 (47.47%) aligned concordantly 0 times
## 67528 (52.03%) aligned concordantly exactly 1 time
## 649 (0.50%) aligned concordantly >1 times
## ----
## 61609 pairs aligned concordantly 0 times; of these:
## 168 (0.27%) aligned discordantly 1 time
## ----
## 61441 pairs aligned 0 times concordantly or discordantly; of these:
## 122882 mates make up the pairs; of these:
## 122825 (99.95%) aligned 0 times
## 53 (0.04%) aligned exactly 1 time
## 4 (0.00%) aligned >1 times
## 52.68% overall alignment rate
## 227392 reads; of these:
## 227392 (100.00%) were paired; of these:
## 119012 (52.34%) aligned concordantly 0 times
## 105470 (46.38%) aligned concordantly exactly 1 time
## 2910 (1.28%) aligned concordantly >1 times
## ----
## 119012 pairs aligned concordantly 0 times; of these:
## 164 (0.14%) aligned discordantly 1 time
## ----
## 118848 pairs aligned 0 times concordantly or discordantly; of these:
## 237696 mates make up the pairs; of these:
## 236899 (99.66%) aligned 0 times
## 656 (0.28%) aligned exactly 1 time
## 141 (0.06%) aligned >1 times
## 47.91% overall alignment rate
## 162373 reads; of these:
## 162373 (100.00%) were paired; of these:
## 79555 (49.00%) aligned concordantly 0 times
## 80277 (49.44%) aligned concordantly exactly 1 time
## 2541 (1.56%) aligned concordantly >1 times
## ----
## 79555 pairs aligned concordantly 0 times; of these:
## 246 (0.31%) aligned discordantly 1 time
## ----
## 79309 pairs aligned 0 times concordantly or discordantly; of these:
## 158618 mates make up the pairs; of these:
## 158276 (99.78%) aligned 0 times
## 293 (0.18%) aligned exactly 1 time
## 49 (0.03%) aligned >1 times
## 51.26% overall alignment rate
## 185442 reads; of these:
## 185442 (100.00%) were paired; of these:
## 98885 (53.32%) aligned concordantly 0 times
## 84215 (45.41%) aligned concordantly exactly 1 time
## 2342 (1.26%) aligned concordantly >1 times
## ----
## 98885 pairs aligned concordantly 0 times; of these:
## 257 (0.26%) aligned discordantly 1 time
## ----
## 98628 pairs aligned 0 times concordantly or discordantly; of these:
## 197256 mates make up the pairs; of these:
## 196792 (99.76%) aligned 0 times
## 376 (0.19%) aligned exactly 1 time
## 88 (0.04%) aligned >1 times
## 46.94% overall alignment rate
Sort and convert the SAM files to BAM.
conda activate rnaseq
while read SAMPLE; do
SAM=tutorial/output/hisat2/${SAMPLE}.sam
BAM=tutorial/output/hisat2/${SAMPLE}.bam
samtools sort -o $BAM $SAM
done < tutorial/data/targets.txt
Index the BAM files.
conda activate rnaseq
while read SAMPLE; do
BAM=tutorial/output/hisat2/${SAMPLE}.bam
samtools index $BAM
done < tutorial/data/targets.txt
Note, the above commands could have been run in the same for-loop, however for ease of understanding we have split them into separate chunks. This is also useful when an error occurs and it is not obvious which of the commands was responsible for the error. For example, the samtools sort command may have throw an error, but it could have been caused by an empty SAM file created by the aligner. The cause of the problem would not have been immediatley obvious if all of the commands were run together.
Inspect the contents of the hisat2 output folder.
ls tutorial/output/hisat2
## HBR_Rep1.bam
## HBR_Rep1.bam.bai
## HBR_Rep1.sam
## HBR_Rep2.bam
## HBR_Rep2.bam.bai
## HBR_Rep2.sam
## HBR_Rep3.bam
## HBR_Rep3.bam.bai
## HBR_Rep3.sam
## UHR_Rep1.bam
## UHR_Rep1.bam.bai
## UHR_Rep1.sam
## UHR_Rep2.bam
## UHR_Rep2.bam.bai
## UHR_Rep2.sam
## UHR_Rep3.bam
## UHR_Rep3.bam.bai
## UHR_Rep3.sam
## chr22.fa.1.ht2
## chr22.fa.2.ht2
## chr22.fa.3.ht2
## chr22.fa.4.ht2
## chr22.fa.5.ht2
## chr22.fa.6.ht2
## chr22.fa.7.ht2
## chr22.fa.8.ht2
Once reads have been mapped to genomic locations, the next step in the analysis process is to assign them to genes, to determine abundance measures. The quantification of read abundances for individual genes (that is, all transcript isoforms for that gene) relies on counting sequence reads that overlap known genes, using a transcriptome annotation. Quantification tools in common use include RSEM, HTSeq133, and featureCounts. The results of the quantification step are usually combined into an expression matrix, with a row for each gene and a column for each sample, with the values being actual read counts. In this step of the workflow we will use featureCounts by calling it from R using the Rsubread package.
First, we need to load the sample table in R. An example file called samples.tsv is included with the
data files for this tutorial. In general, you will have to create this file yourself. It contains information about your RNA-seq samples, formatted as illustrated in the tsv (tab-separated values) file. Each sample should
be described on one row of the file, and each column should contain one variable. To read this file into R, we use the command read.delim.
samples <- read.delim("tutorial/data/samples.tsv")
samples
## sample condition
## 1 HBR_Rep1 HBR
## 2 HBR_Rep2 HBR
## 3 HBR_Rep3 HBR
## 4 UHR_Rep1 UHR
## 5 UHR_Rep2 UHR
## 6 UHR_Rep3 UHR
Construct path to BAM files created by the read alignment step.
files <- file.path("tutorial/output/hisat2", paste0(samples$sample, ".bam"))
files
## [1] "tutorial/output/hisat2/HBR_Rep1.bam" "tutorial/output/hisat2/HBR_Rep2.bam" "tutorial/output/hisat2/HBR_Rep3.bam" "tutorial/output/hisat2/UHR_Rep1.bam"
## [5] "tutorial/output/hisat2/UHR_Rep2.bam" "tutorial/output/hisat2/UHR_Rep3.bam"
Load the Rsubread package which contains the featureCounts function.
library(Rsubread)
Inspect the documentation for the featureCounts function.
# This will open the help page in your R console
help(featureCounts)
Assign mapped sequencing reads to genes.
output <- featureCounts(
files,
annot.ext = "tutorial/data/genes/chr22.gtf",
isGTFAnnotationFile = TRUE,
isPairedEnd = TRUE,
countReadPairs = TRUE
)
##
## ========== _____ _ _ ____ _____ ______ _____
## ===== / ____| | | | _ \| __ \| ____| /\ | __ \
## ===== | (___ | | | | |_) | |__) | |__ / \ | | | |
## ==== \___ \| | | | _ <| _ /| __| / /\ \ | | | |
## ==== ____) | |__| | |_) | | \ \| |____ / ____ \| |__| |
## ========== |_____/ \____/|____/|_| \_\______/_/ \_\_____/
## Rsubread 2.10.5
##
## //========================== featureCounts setting ===========================\\
## || ||
## || Input files : 6 BAM files ||
## || ||
## || HBR_Rep1.bam ||
## || HBR_Rep2.bam ||
## || HBR_Rep3.bam ||
## || UHR_Rep1.bam ||
## || UHR_Rep2.bam ||
## || UHR_Rep3.bam ||
## || ||
## || Paired-end : yes ||
## || Count read pairs : yes ||
## || Annotation : chr22.gtf (GTF) ||
## || Dir for temp files : . ||
## || Threads : 1 ||
## || Level : meta-feature level ||
## || Multimapping reads : counted ||
## || Multi-overlapping reads : not counted ||
## || Min overlapping bases : 1 ||
## || ||
## \\============================================================================//
##
## //================================= Running ==================================\\
## || ||
## || Load annotation file chr22.gtf ... ||
## || Features : 26087 ||
## || Meta-features : 1371 ||
## || Chromosomes/contigs : 1 ||
## || ||
## || Process BAM file HBR_Rep1.bam... ||
## || Paired-end reads are included. ||
## || Total alignments : 119475 ||
## || Successfully assigned alignments : 40229 (33.7%) ||
## || Running time : 0.00 minutes ||
## || ||
## || Process BAM file HBR_Rep2.bam... ||
## || Paired-end reads are included. ||
## || Total alignments : 146011 ||
## || Successfully assigned alignments : 49422 (33.8%) ||
## || Running time : 0.01 minutes ||
## || ||
## || Process BAM file HBR_Rep3.bam... ||
## || Paired-end reads are included. ||
## || Total alignments : 130797 ||
## || Successfully assigned alignments : 43689 (33.4%) ||
## || Running time : 0.00 minutes ||
## || ||
## || Process BAM file UHR_Rep1.bam... ||
## || Paired-end reads are included. ||
## || Total alignments : 231149 ||
## || Successfully assigned alignments : 71297 (30.8%) ||
## || Running time : 0.01 minutes ||
## || ||
## || Process BAM file UHR_Rep2.bam... ||
## || Paired-end reads are included. ||
## || Total alignments : 165729 ||
## || Successfully assigned alignments : 47412 (28.6%) ||
## || Running time : 0.01 minutes ||
## || ||
## || Process BAM file UHR_Rep3.bam... ||
## || Paired-end reads are included. ||
## || Total alignments : 188584 ||
## || Successfully assigned alignments : 56818 (30.1%) ||
## || Running time : 0.01 minutes ||
## || ||
## || Write the final count table. ||
## || Write the read assignment summary. ||
## || ||
## \\============================================================================//
Inspect the first few rows of the count matrix.
head(output$counts)
## HBR_Rep1.bam HBR_Rep2.bam HBR_Rep3.bam UHR_Rep1.bam UHR_Rep2.bam UHR_Rep3.bam
## ENSG00000277248.1 0 0 0 0 0 0
## ENSG00000274237.1 0 0 0 0 0 0
## ENSG00000280363.1 0 0 0 0 0 0
## ENSG00000279973.1 0 0 0 0 0 0
## ENSG00000226444.2 0 0 0 0 1 0
## ENSG00000276871.1 0 0 0 0 0 0
In the count matrix, each row represents a gene, each column a sample, and the values give the raw numbers of fragments that were uniquely assigned to the respective gene in each sample.
Once sequence reads have been processed into an expression matrix, the experiment can be modelled to determine which genes are likely to have changed their level of expression between conditions. Several tools are commonly used to accomplish this; some model read counts of gene-level expression, whereas others rely on transcript-level estimates. Gene-level tools typically rely on aligned read counts and use generalized linear models that enable complex experimental set-ups to be evaluated. These include tools such as edgeR, DESeq2 and limma+voom, which are computationally efficient and provide comparable results. In this step of the workflow we will use DESeq2, a package to perform differential gene expression analysis based on the negative binomial distribution.
Load the DESeq2 package.
library(DESeq2)
In this section we will show how to create the data object used by DESeq2. The object class used by the DESeq2 package to store the read counts and the intermediate estimated quantities during statistical analysis is the DESeqDataSet, which will usually be represented in the code here as an object dds.
A DESeqDataSet object must have an associated design formula. The design formula expresses the variables which will be used in modeling. The formula should be a tilde (~) followed by the variables with plus signs between them (it will be coerced into an formula if it is not already). The design can be changed later, however then all differential analysis steps should be repeated, as the design formula is used to estimate the dispersions and to estimate the log2 fold changes of the model.
The simplest design formula for differential expression would be ~ condition, where condition is a column in the colData that specifies which of two (or more groups) the samples belong to. For this experiment, we will specify ~ condition meaning that we want to test for the effect of condition.
To construct a DESeqDataSet object from a matrix of counts and the sample information table, we use the DESeqDataSetFromMatrix function.
dds <- DESeqDataSetFromMatrix(countData = output$counts, colData = samples, design = ~ condition)
dds
## class: DESeqDataSet
## dim: 1371 6
## metadata(1): version
## assays(1): counts
## rownames(1371): ENSG00000277248.1 ENSG00000274237.1 ... ENSG00000184319.15 ENSG00000079974.17
## rowData names(0):
## colnames(6): HBR_Rep1.bam HBR_Rep2.bam ... UHR_Rep2.bam UHR_Rep3.bam
## colData names(2): sample condition
The DEseqDataSet object class is built on top of the SummarizedExperiment class (Figure 3.1). The SummarizedExperiment class is used to store rectangular matrices of experimental results, which are commonly produced by sequencing and microarray experiments. Each object stores observations of one or more samples, along with additional meta-data describing both the observations (features) and samples (phenotypes). A key aspect of the SummarizedExperiment class is the coordination of the meta-data and assays when subsetting. For example, if you want to exclude a given sample you can do for both the meta-data and assay in one operation, which ensures the meta-data and observed data will remain in sync. For further information, refer to the SummarizedExperiment documentation
Figure 3.1: Schematic of the SummarizedExperiment class
While it is not necessary to pre-filter low count genes before running the DESeq2 functions, there are two reasons which make pre-filtering useful: by removing rows in which there are very few reads, we reduce the memory size of the data object, and we increase the speed of the transformation and testing functions within DESeq2. Here we perform a minimal pre-filtering to keep only rows that have at least 10 reads total.
keep <- rowSums(counts(dds)) >= 10
dds <- dds[keep, ]
As we have already specified an experimental design when we created the DESeqDataSet, we can run the differential expression pipeline on the raw counts with a single call to the function DESeq.
dds <- DESeq(dds)
This function will print out a message for the various steps it performs. These are described in the manual page for DESeq, which can be accessed by typing ?DESeq. Complete details and motivations for the statistical procedures that occur within the DESeq function are described in the DESeq2 article. Briefly these steps are:
The estimation of size factors, controlling for differences in the counts due varying sequencing depth of the samples.
The estimation of dispersion values. The dispersion parameter captures how much the counts for the samples will vary around an expected value. Note that the expected value takes into consideration the sequencing depth and differences that can be attributed to variables in the design formula.
Fitting a final generalized linear model using the size factors and dispersion values estimated above, which gives estimates of the log fold changes.
A DESeqDataSet is returned that contains all the fitted parameters within it, and the following section describes how to extract out results tables of interest from this object.
In general, the results for a comparison of any two levels of a variable can be extracted using the contrast argument to results. The user should specify three values: the name of the variable, the name of the level for the numerator, and the name of the level for the denominator. Here we extract results for the log2 of the fold change of one cell line over another.
res <- results(dds, contrast = c("condition", "HBR", "UHR"))
res
## log2 fold change (MLE): condition HBR vs UHR
## Wald test p-value: condition HBR vs UHR
## DataFrame with 596 rows and 6 columns
## baseMean log2FoldChange lfcSE stat pvalue padj
## <numeric> <numeric> <numeric> <numeric> <numeric> <numeric>
## ENSG00000198062.14 4.28703 -4.28589 1.391617 -3.07979 2.07147e-03 5.58641e-03
## ENSG00000206195.10 78.27301 -4.10683 0.366725 -11.19864 4.14037e-29 8.22554e-28
## ENSG00000271127.1 6.21682 -4.85879 1.334809 -3.64007 2.72568e-04 8.46098e-04
## ENSG00000232775.6 11.02174 -3.72506 0.864055 -4.31114 1.62417e-05 5.72783e-05
## ENSG00000272872.1 25.78507 -4.40809 0.684453 -6.44031 1.19230e-10 6.76770e-10
## ... ... ... ... ... ... ...
## ENSG00000008735.13 285.4336 5.693925 0.267719 21.26829 2.23282e-100 2.66152e-98
## ENSG00000100299.17 56.2960 0.576911 0.260476 2.21483 2.67716e-02 5.37235e-02
## ENSG00000251322.7 245.8691 4.031104 0.196709 20.49276 2.49813e-93 2.12698e-91
## ENSG00000184319.15 58.1105 1.475279 0.271682 5.43016 5.63032e-08 2.64226e-07
## ENSG00000079974.17 60.3887 0.717441 0.262638 2.73167 6.30137e-03 1.49032e-02
As res is a DataFrame object, it carries metadata with information on the meaning of the columns.
mcols(res, use.names = TRUE)
## DataFrame with 6 rows and 2 columns
## type description
## <character> <character>
## baseMean intermediate mean of normalized c..
## log2FoldChange results log2 fold change (ML..
## lfcSE results standard error: cond..
## stat results Wald statistic: cond..
## pvalue results Wald test p-value: c..
## padj results BH adjusted p-values
The first column, baseMean, is a just the average of the normalized count values, dividing by size factors, taken over all samples in the DESeqDataSet. The remaining four columns refer to a specific contrast, namely the comparison of the HBR level over the UHR level for the factor variable condition.
The column log2FoldChange is the effect size estimate. It tells us how much the gene’s expression seems to have changed between conditions. This value is reported on a logarithmic scale to base 2: for example, a log2 fold change of 1.5 means that the gene’s expression is increased by a multiplicative factor of 2^1.5 ≈ 2.82.
Of course, this estimate has an uncertainty associated with it, which is available in the column lfcSE, the standard error estimate for the log2 fold change estimate. We can also express the uncertainty of a particular effect size estimate as the result of a statistical test. The purpose of a test for differential expression is to test whether the data provides sufficient evidence to conclude that this value is really different from zero. DESeq2 performs for each gene a hypothesis test to see whether evidence is sufficient to decide against the null hypothesis that there is zero effect of the treatment on the gene and that the observed difference between treatment and control was merely caused by experimental variability (i.e., the type of variability that you can expect between different samples in the same treatment group).
As usual in statistics, the result of this test is reported as a p value, and it is found in the column pvalue. A p value indicates the probability that a fold change as strong as the observed one, or even stronger, would be seen under the situation described by the null hypothesis. The adjusted p values satisfy the property that thresholding at a specific value defines a set of tests (one for each gene) with a bounded false discovery rate (FDR), typically a useful metric for assessing which genes to target for further analysis. For example, the set of genes with adjusted p value less than 0.1 should contain no more than 10% false positives.
We can also summarize the results with the following line of code, which reports some additional information.
summary(res)
##
## out of 596 with nonzero total read count
## adjusted p-value < 0.1
## LFC > 0 (up) : 158, 27%
## LFC < 0 (down) : 167, 28%
## outliers [1] : 0, 0%
## low counts [2] : 0, 0%
## (mean count < 1)
## [1] see 'cooksCutoff' argument of ?results
## [2] see 'independentFiltering' argument of ?results
A quick way to visualize the counts for a particular gene is to use the plotCounts function that takes as arguments the DESeqDataSet, a gene name, and the group over which to plot the counts.
top <- rownames(res)[which.min(res$padj)]
plotCounts(dds, gene = top, intgroup = "condition")
Figure 3.2: Normalized counts for a single gene over condition group.
Another plot we can genrate, is a so-called MA-plot - this provides a useful overview for an experiment with a two-group comparison.
plotMA(res)
Figure 3.3: An MA-plot of changes induced by condition.
The log2 fold change for a particular comparison is plotted on the y-axis and the average of the counts normalized by size factor is shown on the x-axis (“M” for minus, because a log ratio is equal to log minus log, and “A” for average). Each gene is represented with a dot. Genes with an adjusted p value below a threshold (here 0.1, the default) are shown in blue.
Our result table so far only contains information about Ensembl gene IDs, but alternative gene names may be more informative for collaborators. Bioconductor’s annotation packages help with mapping various ID schemes to each other. We load the AnnotationDbi package and the annotation package org.Hs.eg.db:
library(AnnotationDbi)
library(org.Hs.eg.db)
This is the organism annotation package (“org”) for Homo sapiens (“Hs”), organized as an AnnotationDbi database package (“db”), using Entrez Gene IDs (“eg”) as primary key. To get a list of all available key types, use:
columns(org.Hs.eg.db)
## [1] "ACCNUM" "ALIAS" "ENSEMBL" "ENSEMBLPROT" "ENSEMBLTRANS" "ENTREZID" "ENZYME" "EVIDENCE" "EVIDENCEALL" "GENENAME"
## [11] "GENETYPE" "GO" "GOALL" "IPI" "MAP" "OMIM" "ONTOLOGY" "ONTOLOGYALL" "PATH" "PFAM"
## [21] "PMID" "PROSITE" "REFSEQ" "SYMBOL" "UCSCKG" "UNIPROT"
We can use the mapIds function to add individual columns to our results table. We provide the row names of our results table as a key, and specify that keytype=ENSEMBL. The column argument tells the mapIds function which information we want, and the multiVals argument tells the function what to do if there are multiple possible values for a single input value. Here we ask to just give us back the first one that occurs in the database. To add the gene symbol and Entrez ID, we call mapIds twice.
ids <- strsplit(rownames(res), ".", fixed = TRUE)
ids <- sapply(ids, head, n = 1)
res$symbol <- mapIds(
x = org.Hs.eg.db,
keys = ids,
column = "SYMBOL",
keytype = "ENSEMBL",
multiVals = "first"
)
res$entrez <- mapIds(
x = org.Hs.eg.db,
keys = ids,
column = "ENTREZID",
keytype = "ENSEMBL",
multiVals = "first"
)
Now the results have the desired external gene IDs:
# Output the first few lines of the results table
head(res)
## log2 fold change (MLE): condition HBR vs UHR
## Wald test p-value: condition HBR vs UHR
## DataFrame with 6 rows and 8 columns
## baseMean log2FoldChange lfcSE stat pvalue padj symbol entrez
## <numeric> <numeric> <numeric> <numeric> <numeric> <numeric> <character> <character>
## ENSG00000198062.14 4.28703 -4.28589 1.391617 -3.07979 2.07147e-03 5.58641e-03 POTEH 23784
## ENSG00000206195.10 78.27301 -4.10683 0.366725 -11.19864 4.14037e-29 8.22554e-28 DUXAP8 503637
## ENSG00000271127.1 6.21682 -4.85879 1.334809 -3.64007 2.72568e-04 8.46098e-04 NA NA
## ENSG00000232775.6 11.02174 -3.72506 0.864055 -4.31114 1.62417e-05 5.72783e-05 BMS1P22 106480334
## ENSG00000272872.1 25.78507 -4.40809 0.684453 -6.44031 1.19230e-10 6.76770e-10 NA NA
## ENSG00000223875.2 1.82601 -4.08525 1.646005 -2.48192 1.30677e-02 2.92513e-02 NA NA
A plain-text file of the results can be exported using the base R functions write.csv or write.delim. We suggest using a descriptive file name indicating the variable and levels which were tested.
write.csv(as.data.frame(res), file = "tutorial/output/HBR-UHR.results.csv")
Exporting only the results which pass an adjusted p value threshold can be accomplished with the subset function, followed by the write.csv function.
sig <- subset(res, padj < 0.1)
write.csv(as.data.frame(sig), file = "tutorial/output/HBR-UHR.signif.csv")
As the last part of this document, we call the function sessionInfo, which reports the version numbers of R and all the packages used in this session. It is good practice to always keep such a record of this as it will help to trace down what has happened in case an R script ceases to work or gives different results because the functions have been changed in a newer version of one of your packages. By including it at the bottom of a script, your reports will become more reproducible.
sessionInfo()
## R version 4.2.1 (2022-06-23)
## Platform: x86_64-apple-darwin17.0 (64-bit)
## Running under: macOS Monterey 12.6
##
## Matrix products: default
## LAPACK: /Library/Frameworks/R.framework/Versions/4.2/Resources/lib/libRlapack.dylib
##
## locale:
## [1] en_GB.UTF-8/en_GB.UTF-8/en_GB.UTF-8/C/en_GB.UTF-8/en_GB.UTF-8
##
## attached base packages:
## [1] stats4 stats graphics grDevices utils datasets methods base
##
## other attached packages:
## [1] org.Hs.eg.db_3.15.0 AnnotationDbi_1.58.0 DESeq2_1.36.0 SummarizedExperiment_1.26.1 Biobase_2.56.0
## [6] MatrixGenerics_1.8.1 matrixStats_0.62.0 GenomicRanges_1.48.0 GenomeInfoDb_1.32.4 IRanges_2.30.1
## [11] S4Vectors_0.34.0 BiocGenerics_0.42.0 Rsubread_2.10.5 fontawesome_0.3.0.9000 captioner_2.2.3
## [16] bookdown_0.29 knitr_1.40 rmarkdown_2.16
##
## loaded via a namespace (and not attached):
## [1] bitops_1.0-7 bit64_4.0.5 RColorBrewer_1.1-3 httr_1.4.4 tools_4.2.1 bslib_0.4.0 utf8_1.2.2
## [8] R6_2.5.1 DBI_1.1.3 colorspace_2.0-3 tidyselect_1.1.2 bit_4.0.4 compiler_4.2.1 cli_3.4.0
## [15] DelayedArray_0.22.0 sass_0.4.2 scales_1.2.1 genefilter_1.78.0 stringr_1.4.1 digest_0.6.29 XVector_0.36.0
## [22] pkgconfig_2.0.3 htmltools_0.5.3 vembedr_0.1.5 fastmap_1.1.0 highr_0.9 rlang_1.0.5 rstudioapi_0.14
## [29] RSQLite_2.2.17 jquerylib_0.1.4 generics_0.1.3 jsonlite_1.8.0 BiocParallel_1.30.3 dplyr_1.0.10 RCurl_1.98-1.8
## [36] magrittr_2.0.3 GenomeInfoDbData_1.2.8 Matrix_1.5-1 Rcpp_1.0.9 munsell_0.5.0 fansi_1.0.3 lifecycle_1.0.2
## [43] stringi_1.7.8 yaml_2.3.5 zlibbioc_1.42.0 grid_4.2.1 blob_1.2.3 parallel_4.2.1 crayon_1.5.1
## [50] lattice_0.20-45 Biostrings_2.64.1 splines_4.2.1 annotate_1.74.0 KEGGREST_1.36.3 locfit_1.5-9.6 pillar_1.8.1
## [57] geneplotter_1.74.0 codetools_0.2-18 XML_3.99-0.10 glue_1.6.2 evaluate_0.16 png_0.1-7 vctrs_0.4.1
## [64] gtable_0.3.1 purrr_0.3.4 assertthat_0.2.1 cachem_1.0.6 ggplot2_3.3.6 xfun_0.33 tufte_0.12
## [71] xtable_1.8-4 survival_3.4-0 tibble_3.1.8 memoise_2.0.1
This tutorial was produced from a number of different online tutorials and review articles. Most of the text has been duplicated verbatim or adapted to fit the tutorial objectives.